This is the abstract of the report. It should be a short summary of the project, the data, the analysis and the results. It should be concise and to the point. It should not be longer than 250 words.
Error in contrib.url(repos, "source") :
trying to use CRAN without setting a mirror
Warning in library(package, lib.loc = lib.loc, character.only = TRUE,
logical.return = TRUE, : there is no package called 'patchwork'
Error in contrib.url(repos, "source") :
trying to use CRAN without setting a mirror
Warning in library(package, lib.loc = lib.loc, character.only = TRUE,
logical.return = TRUE, : there is no package called 'maps'
Error in contrib.url(repos, "source") :
trying to use CRAN without setting a mirror
Warning in library(package, lib.loc = lib.loc, character.only = TRUE,
logical.return = TRUE, : there is no package called 'kableExtra'
How to include sections separately
You can use {include X} to include different sections of your report as separate .qmd files. This is also well documented in the Quarto documentation: https://quarto.org/docs/authoring/includes
As mentioned in the documentation, we have used (_) prefix for the included files (e.g., _introduction.qmd and _data.qmd). You should always use an underscore prefix with included files so that they are automatically ignored (i.e. not treated as standalone files) by a quarto render of a project (not absolutely necessary in your case, but highly recommended).
Rendering only report.qmd will render also all the other files.
1 Introduction
Obesity has become a major global health issue, with its prevalence tripling since 1975. According to the WHO, 1.9 billion adults were overweight in 2022, with over 650 million classified as obese. In Latin America and the Caribbean, the problem is particularly pressing: as of 2022, 25% of adults were obese, with rates reaching 36.1% in Mexico and around 28% and 23% in Peru and Colombia, respectively. These alarming trends contribute to rising cases of obesity-related diseases, such as diabetes and cardiovascular issues. This project, using data from Mexico, Peru, and Colombia—77% of which is synthetically generated via SMOTE and 23% collected from 498 participants online—seeks to explore how lifestyle factors contribute to obesity in these regions. While the synthetic nature of the data limits real-world applicability, this scenario allows for the practical application of concepts from the “Data Science in Business Analytics” course, enabling us to identify key patterns in dietary habits and physical activity that contribute to obesity.
Our primary goal is to identify the most significant behavioral factors contributing to obesity in these countries by conducting exploratory analyses on lifestyle patterns and building a regression and a predictive model based on factors like diet, activity level, and demographics. Visualizations will also be developed to illustrate findings and relationships clearly, enhancing stakeholder understanding of the insights derived. Although synthetic data limits the findings’ applicability, this exercise provides valuable training in data analysis techniques and the potential insights obtainable from comprehensive, real-world data.
The main research questions this project addresses include identifying which lifestyle factors significantly impact obesity in these regions and exploring whether obesity can be predicted based on these factors. By focusing on key lifestyle elements—diet and physical activity—that influence obesity, the data used is tailored to the cultural contexts of Mexico, Peru, and Colombia. Through these insights, we aim to inform public health initiatives, providing actionable data for healthcare organizations and policymakers to address the growing obesity crisis effectively.
1.0.1 2. Data
We planned to acquire data from a publicly available dataset.
The dataset used for this project, titled “Estimation of Obesity Levels Based on Eating Habits and Physical Condition,” was sourced from the UCI Machine Learning Repository.
This dataset, available in CSV, was originally compiled by researchers at the Universidad de la Costa, Colombia, and includes of both synthetically generated data and user-collected data. The 23% of the data was collected through a web page using a survey accesible online for 30 days, in which 498 individuals provided information regarding their dietary habits, physical activity levels, and demographic data. The remaining 77% of the dataset was generated synthetically using the SMOTE algorithm (Synthetic Minority Over-sampling Technique) in Weka. SMOTE was applied to balance the dataset, addressing issues of class imbalance by generating synthetic examples for minority classes.
At the end, the obtained dataset contains 17 attributes and 2111 records.
There are limitations and challenges associated with using this data. First, the reliance on synthetic data means that the results may not accurately represent real-world scenarios, as it lacks the nuances and variability present in genuine human behaviors. Second, while the user-collected data can provide valuable insights, it may be subject to biases, such as self-reporting inaccuracies and sampling biases, which can impact the reliability of our findings. Additionally, gathering data from diverse geographical regions might pose challenges in reaching a representative sample, and we must ensure that the survey is accessible and engaging to participants to encourage participation.
Below, the steps related to the conducted analyses will be outlined.
Gender Age Height Weight family_history_with_overweight FAVC FCVC NCP
1 Female 21 1.62 64.0 yes no 2 3
2 Female 21 1.52 56.0 yes no 3 3
3 Male 23 1.80 77.0 yes no 2 3
4 Male 27 1.80 87.0 no no 3 3
5 Male 22 1.78 89.8 no no 2 1
6 Male 29 1.62 53.0 no yes 2 3
CAEC SMOKE CH2O SCC FAF TUE CALC MTRANS
1 Sometimes no 2 no 0 1 no Public_Transportation
2 Sometimes yes 3 yes 3 0 Sometimes Public_Transportation
3 Sometimes no 2 no 2 1 Frequently Public_Transportation
4 Sometimes no 2 no 2 0 Frequently Walking
5 Sometimes no 2 no 0 0 Sometimes Public_Transportation
6 Sometimes no 2 no 0 0 Sometimes Automobile
NObeyesdad
1 Normal_Weight
2 Normal_Weight
3 Normal_Weight
4 Overweight_Level_I
5 Overweight_Level_II
6 Normal_Weight
All columns contain complete data, with no missing values. If missing data were present, we could address it by either removing rows with missing values using > dataset_cleaned <- na.omit(dataset) or imputing missing values, for instance, by replacing missing “Age” values with the mean age (dataset$Age[is.na(dataset$Age)] <- mean(dataset$Age, na.rm = TRUE)).
Converting categorical variables to factors now enables R to correctly interpret them as discrete categories, which is essential for accurate analysis and modeling.
#Convert Gender to factor with levels 0 for Female and 1 for Maledataset$gender <-as.factor(dataset$gender)#Convert family_hist, caloric_food, smoke, calorie_check to factor with levels 0 for No and 1 for Yesdataset$family_hist <-as.factor(dataset$family_hist)dataset$caloric_food <-as.factor(dataset$caloric_food)dataset$smoke <-as.factor(dataset$smoke)dataset$calorie_check <-as.factor(dataset$calorie_check)#Convert other categorical variables to factors as beforedataset$m_trans <-as.factor(dataset$m_trans)#Convert "obesity_lev", "food_btw_meals" and "freq_alcohol" to an ordinal factor with the correct levelsdataset$obesity_lev <-factor(dataset$obesity_lev, levels =c("Insufficient_Weight", "Normal_Weight", "Overweight_Level_I", "Overweight_Level_II", "Obesity_Type_I", "Obesity_Type_II", "Obesity_Type_III"), ordered =TRUE)dataset$food_btw_meals <-factor(dataset$food_btw_meals, levels =c("No", "Sometimes", "Frequently", "Always"), ordered =TRUE)#We standardize "no" to "No" to avoid NA instead of nodataset$freq_alcohol[dataset$freq_alcohol =="no"] <-"No"#Now we convert it to an ordinal factor with the correct levelsdataset$freq_alcohol <-factor(dataset$freq_alcohol, levels =c("No", "Sometimes", "Frequently", "Always"), ordered =TRUE)
Using str() before and after confirms that each variable has the correct data type, preventing errors during analysis.
After applying SMOTE, the distribution is noticeably more balanced across all categories, with each class showing a similar count. This outcome reflects SMOTE’s intended effect of addressing class imbalance.
1.0.1.3 2.3 Distribution analysis
Density plot for age.
Code
ggplot(dataset, aes(x = age, fill = obesity_lev)) +geom_density(alpha =0.5) +theme_minimal() +ggtitle("Age Distribution by Obesity Levels")
Some peaks appear sharp and could indicate overfitting or unnatural clustering due to synthetic data (for example, a prominent peak in “Obesity Type I” around the age of 30). The categories appear to be well-separated, which may help the model learn patterns effectively, but it’s essential to ensure these separations are logical and not artifacts introduced by SMOTE.
The summary statistics show relatively consistent means and standard deviations for Age, Height, and Weight across obesity levels, which suggests that SMOTE has preserved the overall distribution without introducing extreme values. Interpretation: Since the means and standard deviations are similar across classes, it appears SMOTE didn’t drastically alter the dataset’s variability. This consistency supports the idea that SMOTE effectively balanced the classes without distorting key variable distributions.
Perform K-means clustering and calculate silhouette score.
Silhouette Score from K-means Clustering: The mean silhouette score of approximately 0.456 suggests a moderate level of cohesion within clusters and some separation between them. This score indicates that the clusters (representing obesity levels) are neither too distinct nor too blended. Interpretation: A score close to 0.5 generally reflects reasonable class separability without excessive artificial separability. This score suggests that SMOTE has helped create distinguishable but not overly isolated clusters, which is desirable for class balance. We conclude that SMOTE has balanced the dataset without drastically distorting it.
Creating a Numerical Dataset “dataset_num”.
Code
#Change of obesity level variable character to numericdataset_num <- dataset%>%mutate(obesity_lev =recode(obesity_lev, "Insufficient_Weight"=1, "Normal_Weight"=2, "Overweight_Level_I"=3, "Overweight_Level_II"=4, "Obesity_Type_I"=5, "Obesity_Type_II"=6, "Obesity_Type_III"=7,))str(dataset_num$obesity_lev)
Selection of possible factor influencing obesity level.
Code
#Create the heatmap with correlation values#Assuming dataset_num is already defined and contains the relevant columnscor_matrix <-cor(dataset_num %>%select("physical_act", "freq_alcohol", "obesity_lev", "age", "weight","height", "family_hist", "caloric_food", "vegetable_food", "food_btw_meals", "use_tech", "ch2o", "m_trans", "smoke","nb_meal_day", "calorie_check", "gender"), use ="complete.obs")#Extract the correlations with 'obesity_lev'cor_with_obesity_lev <- cor_matrix["obesity_lev",]#Order variables by their correlation with 'obesity_lev'ordered_vars <-names(sort(cor_with_obesity_lev, decreasing =TRUE))#Reorder the correlation matrix based on this ordercor_matrix_ordered <- cor_matrix[ordered_vars, ordered_vars]#Melt the ordered correlation matrix into long formatcor_long <-melt(cor_matrix_ordered)ggplot(cor_long, aes(x = Var1, y = Var2, fill = value)) +geom_tile() +geom_text(aes(label =round(value, 2)), color ="black", size =2.5, vjust =0.5, hjust =0.5) +# Center text within tilesscale_fill_gradient2(low ="blue", mid ="white", high ="red", midpoint =0) +labs(title ="Correlation Heatmap Ordered by Obesity Level", x ="Variables", y="Variables") +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1), # Rotate x-axis labels for readabilityaxis.text.y =element_text(angle =45, vjust =1) # Rotate y-axis labels for readability )
1.0.2 3. Exploratory Data Analysis (EDA)
1.0.2.1 3.1 Descriptive statistics and distribution analysis
1.0.2.1.1 Age
Descriptive statistic for Age.
Code
summary(dataset$age)
Min. 1st Qu. Median Mean 3rd Qu. Max.
14.00 19.92 22.85 24.35 26.00 61.00
Code
sd(dataset$age, na.rm =TRUE)
[1] 6.368801
Age distribution.
Code
ggplot(dataset, aes(x = age)) +geom_histogram(bins =20, fill ="skyblue", color ="black", alpha =0.7) +labs(title ="Age Distribution", x ="Age", y ="Count")+theme_minimal()
The average age of participants is 24.3 years, with a standard deviation of 6.35 years, predominantly concentrated in the 20-30 age range This distribution suggests a young population, which may limit the generalizability of results to older age groups, where risk factors for overweight and obesity may differ.
Age by Obesity Level
Distribution of Age by Obesity Level.
Code
ggplot(dataset, aes(x = age, fill = obesity_lev)) +geom_histogram(bins =20, color ="black", alpha =0.6) +facet_wrap(~ obesity_lev) +ggtitle("Distribution of Age by Obesity Level") +labs(x ="Age", y ="Count") +theme_minimal()
The age distribution varies across obesity levels, with younger ages being more prevalent in lower obesity levels (like Insufficient Weight and Normal Weight), while higher obesity levels tend to have older individuals. This suggests a possible trend where age might correlate with obesity level, especially in higher obesity categories.
Age Distribution by Obesity Level (Violin Plot).
Code
ggplot(dataset, aes(x = obesity_lev, y = age, fill = obesity_lev)) +geom_violin(trim =FALSE, alpha =0.6) +geom_boxplot(width =0.1, color ="black", fill ="white") +labs(title ="Age Distribution by Obesity Level", x ="Obesity Level", y ="Age") +theme_minimal()
The violin plot shows, more clearly, the spread and density of ages across obesity levels. Younger ages dominate in the lower obesity levels, while there is a wider age range with a higher density around 30–40 years in the higher obesity levels. This pattern supports the idea that older age groups are more likely to fall into higher obesity categories.
Age Distribution with SMOOTH Trend Line for Obesity Probability.
Code
ggplot(dataset, aes(x = age, y =as.numeric(obesity_lev))) +geom_jitter(alpha =0.3) +geom_smooth(method ="loess", se =FALSE, color ="blue") +labs(title ="Trend of Obesity Level with Age", x ="Age", y ="Obesity Level") +theme_minimal()
`geom_smooth()` using formula = 'y ~ x'
The trend line suggests an increase in obesity level with age until around 30–35, followed by a decrease. This implies that middle age might be a peak period for higher obesity levels, and there may be a trend of reducing obesity levels at older ages.
1.0.2.1.2 Height
Descriptive statistic for Height.
Code
summary(dataset$height)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.450 1.630 1.702 1.703 1.769 1.980
Code
sd(dataset$height, na.rm =TRUE)
[1] 0.09318594
Height distribution.
Code
ggplot(dataset, aes(x = height)) +geom_histogram(bins =20, fill ="purple", color ="black", alpha =0.7) +labs(title ="Height Distribution", x ="Height (m)", y ="Count") +theme_minimal()
Height by gender
Density plot for height distribution by gender.
Code
ggplot(dataset, aes(x = height, fill = gender)) +geom_density(alpha =0.5) +labs(title ="Density Plot of Height by Gender", x ="Height", y ="Density") +scale_fill_manual(values =c("pink", "lightblue"), name ="Gender", labels =c("Female", "Male")) +theme_minimal()
The average height is 1.70 m (SD = 0.09 m), with a notable difference between genders, where males tend to have a higher median height.
Height by Obesity Level
Box Plot of Height by Obesity Level.
Code
ggplot(dataset, aes(x = obesity_lev, y = height, fill = obesity_lev)) +geom_boxplot(alpha =0.6) +labs(title ="Height Distribution by Obesity Level", x ="Obesity Level", y ="Height") +theme_minimal() +theme(legend.position ="none", axis.text.x =element_text(angle =45, hjust =1))
The boxplot indicates that individuals with lower obesity levels (e.g., Insufficient Weight, Normal Weight) tend to have a more consistent height range, while those in higher obesity levels (like Obesity Type II and III) show more variability in height. This suggests that weight may be more influential than height alone in determining obesity level.
1.0.2.1.3 Weight
Descriptive statistic for Weight.
Code
summary(dataset$weight)
Min. 1st Qu. Median Mean 3rd Qu. Max.
39.00 66.00 83.10 86.86 108.02 173.00
Code
sd(dataset$weight, na.rm =TRUE)
[1] 26.19085
Weight by gender
Density plot for weight distribution by gender.
Code
ggplot(dataset, aes(x = weight, fill = gender)) +geom_density(alpha =0.5) +labs(title ="Density Plot of Weight by Gender", x ="Weight", y ="Density") +scale_fill_manual(values =c("pink", "lightblue"), name ="Gender", labels =c("Female", "Male")) +theme_minimal()
The density plot reveals distinct weight distributions between genders. Females generally weigh less, with a peak around 60-70 units, while males peak around 90-100 and 110-120 units, indicating a tendency toward higher weights. The overlapping region around 80-90 units shows weights common to both genders, though males dominate at higher ranges Weight ranges from 39 to 173 units, with an average (mean) weight of 86.6 units. The median weight is 83 units, with a standard deviation of 26.2, indicating moderate spread.
Weight by obesity level
Ridgeline Plot of Weight by Obesity Level.
Code
ggplot(dataset, aes(x = weight, y = obesity_lev, fill = obesity_lev)) +geom_density_ridges(scale =0.9, alpha =0.6) +labs(title ="Ridgeline Plot of Weight by Obesity Level", x ="Weight", y ="Obesity Level") +theme_minimal() +theme(legend.position ="none")
Picking joint bandwidth of 2.63
This ridgeline plot shows a clear progression in weight distribution across different obesity levels. As the obesity level increases, the weight distribution shifts progressively to higher ranges. “Normal Weight” and “Insufficient Weight” categories are concentrated at lower weights, while higher obesity types (I, II, and III) peak at significantly greater weights, indicating a strong positive association between weight and obesity level The weight distribution has an average of 86.6 kg and a standard deviation of 26.6 kg.
1.0.2.1.4 Height and Weight
Scatter Plot (height vs weight), colored by obesity level.
Code
ggplot(dataset, aes(x = height, y = weight, color = obesity_lev)) +geom_point(alpha =0.7) +geom_smooth(method ="lm", se =FALSE, aes(group = obesity_lev)) +# Adds a trend line for each obesity levelggtitle("Scatter Plot of Weight vs Height by Obesity Level") +theme_minimal() +labs(x ="Height", y ="Weight", color ="Obesity Level")
`geom_smooth()` using formula = 'y ~ x'
Facet Grid for Height and Weight by Obesity Level.
Code
ggplot(dataset, aes(x = height, y = weight)) +geom_point(alpha =0.7, aes(color = obesity_lev)) +facet_wrap(~ obesity_lev) +ggtitle("Facet Grid of Weight and Height by Obesity Level") +theme_minimal() +labs(x ="Height", y ="Weight", color ="Obesity Level") +theme(legend.position ="none")
The scatter plot with trend lines for each obesity level reveals a clear positive correlation between weight and height across all obesity levels. As the obesity level increases, the slope generally becomes steeper, indicating a stronger weight gain relative to height. We created the facet grid to show more clearly the trends to show more clearly how The “Obesity_Type_III” (yellow) category has the steepest slope, suggesting a significant weight increase per unit of height, which is consistent with the highest level of obesity.
Correlation between height and weight.
Code
correlation_height_weight <-cor(dataset$height, dataset$weight, use ="complete.obs")correlation_height_weight
[1] 0.457468
The correlation observed between height and weight (r = 0.463) aligns with existing literature, confirming the expected positive relationship between these variables.
1.0.2.1.5 Function for EDA on categorical variables
Function for EDA on categorical variables, with separate distributions for gender and obesity levels.
Code
eda_categorical <-function(data, variable, gender_var ="gender", obesity_var ="obesity_lev") {cat("\n--- EDA for Categorical Variable:", variable, "---\n")# Frequency and Proportion Tablescat("\nFrequency Table:\n")freq_table <-table(data[[variable]])print(freq_table)cat("\nProportion Table (Rounded):\n")prop_table <-round(prop.table(freq_table), 2)print(prop_table)# Bar Chart with Counts (distribution)p1 <-ggplot(data, aes(x = .data[[variable]])) +geom_bar(fill ="skyblue", color ="black") +ggtitle(paste("Distribution of", variable, "- Counts")) +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1)) +labs(y ="Count")print(p1)# Distribution by Genderp2 <-ggplot(data, aes(x = .data[[variable]], fill = .data[[gender_var]])) +geom_bar(position ="dodge", color ="black") +ggtitle(paste("Distribution of", variable, "by Gender")) +labs(fill ="Gender") +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1)) +labs(y ="Count")print(p2)# Dodged Bar Chart for Categorical Variable by Obesity Levelsp3 <-ggplot(data, aes(x = .data[[variable]], fill = .data[[obesity_var]])) +geom_bar(position ="dodge", color ="black") +ggtitle(paste("Dodged Bar Chart of", variable, "by", obesity_var)) +labs(y ="Count", fill = obesity_var) +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1))print(p3)# Stacked Bar Chart of Food Between Meals by Obesity Level (Proportions within each Obesity Level)p4 <-ggplot(data, aes(x = .data[[obesity_var]], fill = .data[[variable]])) +geom_bar(position ="fill") +scale_y_continuous(labels = scales::percent) +ggtitle(paste("Stacked Bar Chart of", variable, "by", obesity_var, "- Proportions within each Obesity Level")) +labs(x = obesity_var, y ="Proportion", fill = variable) +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1))print(p4)}#Food between meals (snacking) eda_categorical(dataset, "food_btw_meals", gender_var ="gender", obesity_var ="obesity_lev")
--- EDA for Categorical Variable: food_btw_meals ---
Frequency Table:
No Sometimes Frequently Always
0 1761 236 53
Proportion Table (Rounded):
No Sometimes Frequently Always
0.00 0.86 0.12 0.03
Overall distribution - There are very few who selected “Always” or “no.”
Distribution by Gender - it appears that a slightly higher proportion of males selected “Sometimes” compared to females.
Distribution by Obesity Level - As obesity levels increase, there is a noticeable shift towards “Sometimes” and “Frequently” responses, particularly in higher obesity levels (Type I, II, and III).
Individuals with “Insufficient Weight” and “Normal Weight” show a relatively balanced spread between “no,” “Sometimes,” and “Frequently.”
These observations suggest a potential correlation between the frequency of eating between meals and obesity levels. Individuals who report eating between meals “Sometimes” or “Frequently” seem more likely to have higher obesity levels.
In our further work we can explore the interaction with people who engage in regular physical activity, caloric_food often have increased caloric needs and may snack more frequently. Also ch20, people who drink less water may have more cravings or perceive thirst as hunger, leading them to snack more.
--- EDA for Categorical Variable: caloric_food ---
Frequency Table:
no yes
243 1844
Proportion Table (Rounded):
no yes
0.12 0.88
Overall distribution - The majority of individuals prefer caloric food, as indicated by the high count of “yes” responses.
Distribution by Gender - Both males and females show a similar pattern, with a strong preference for caloric food (“yes”), although the preference is slightly higher in males.
Distribution by Obesity Level - As obesity levels increase, the preference for caloric food (“yes”) also increases, particularly in higher obesity categories (Obesity Types I, II, and III).
Proportion Analysis by Obesity Level - Individuals with higher obesity levels are overwhelmingly more likely to prefer caloric food, with nearly 100% of those in Obesity Types II and III choosing “yes”.
Count participants who answered “yes” for frequent high-calorie food consumption.
A notable 88.4% of participants report frequent consumption of high-calorie foods, which may contribute to weight gain. This trend highlights the need for dietary interventions focused on reducing high-calorie intake.
--- EDA for Categorical Variable: freq_alcohol ---
Frequency Table:
No Sometimes Frequently Always
636 1380 70 1
Proportion Table (Rounded):
No Sometimes Frequently Always
0.30 0.66 0.03 0.00
The majority of individuals report drinking alcohol “Sometimes” or “no,” with very few indicating “Frequently” or “Always.”.
Distribution by Gender - Both males and females primarily drink “Sometimes” or “no,” with females showing a slightly higher preference for “Sometimes”.
Distribution by Obesity Level - In lower obesity levels (Insufficient and Normal Weight), a significant proportion reports “no” alcohol consumption,In higher obesity levels, particularly Obesity_Type_I and above, there is an increased tendency to drink “Sometimes,” while “no” responses decline.
Proportion Analysis by Obesity Level - Individuals with higher obesity levels are overwhelmingly more likely to prefer caloric food, with nearly 100% of those in Obesity Types II and III choosing “yes”.
Line plot to show the pattern.
Code
# Prepare the data summary for 'Sometimes' and 'No' responsesdata_summary <- dataset %>%filter(freq_alcohol %in%c("Sometimes", "No")) %>%group_by(obesity_lev, freq_alcohol) %>%summarise(count =n(), .groups ="drop") %>%group_by(obesity_lev) %>%mutate(total =sum(count), proportion = count / total) %>%ungroup()# Check if data_summary has the expected columnsprint(head(data_summary))
# A tibble: 6 × 5
obesity_lev freq_alcohol count total proportion
<ord> <ord> <int> <int> <dbl>
1 Insufficient_Weight No 117 266 0.440
2 Insufficient_Weight Sometimes 149 266 0.560
3 Normal_Weight No 104 263 0.395
4 Normal_Weight Sometimes 159 263 0.605
5 Overweight_Level_I No 50 260 0.192
6 Overweight_Level_I Sometimes 210 260 0.808
Code
ggplot(data_summary, aes(x = obesity_lev, y = proportion, group = freq_alcohol, color = freq_alcohol)) +geom_line(linewidth =1.2) +geom_point(size =3) +scale_y_continuous(labels = scales::percent) +ggtitle("Proportion of 'Sometimes' and 'No' Responses for Alcohol Frequency by Obesity Level") +labs(x ="Obesity Level", y ="Proportion", color ="Alcohol Frequency") +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1))
The proportion of individuals who drink alcohol “Sometimes” increases with higher obesity levels, peaking in Obesity_Type_III. In contrast, the likelihood of abstaining from alcohol (“no”) decreases as obesity levels rise. This pattern suggests that moderate alcohol consumption may be associated with higher obesity levels, while abstention is more common among those with lower obesity levels.
A possible interaction to investigate later is between alcohol frequency and caloric food preference, as both behaviors seem linked to higher obesity levels. Exploring this could reveal if individuals with a preference for caloric foods and moderate alcohol consumption have a compounding effect on obesity risk. This investigation could help clarify whether combined lifestyle factors contribute more significantly to higher obesity levels than each factor alone.
--- EDA for Categorical Variable: calorie_check ---
Frequency Table:
no yes
1991 96
Proportion Table (Rounded):
no yes
0.95 0.05
The vast majority of individuals do not check calories.
Distribution by Gender - Only a small percentage in each gender reporting that they check calories.
Distribution by Obesity Level - There is a slight increase in calorie-checking behavior among individuals with lower obesity levels (Insufficient and Normal Weight).
Proportion Analysis by Obesity Level - Calorie-checking behavior slightly decreases as obesity levels increase.
Code
data_summary <- dataset %>%group_by(obesity_lev, calorie_check) %>%summarise(count =n(), .groups ="drop") %>%mutate(total =sum(count), proportion = count / total)ggplot(data_summary, aes(x = obesity_lev, y = proportion, group = calorie_check, color = calorie_check)) +geom_line(size =1.2) +geom_point(size =3) +scale_y_continuous(labels = scales::percent) +scale_color_manual(values =c("no"="lightcoral", "yes"="lightblue")) +labs(title ="Proportion of Calorie Checking by Obesity Level", x ="Obesity Level", y ="Proportion", color ="Calorie Check") +theme_minimal()
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
1.0.2.1.8 Vegetable consumption
Distribution of vegetable consumption frequency (vegetable_food): initially, a bar chart was created. This allowed us to clearly identify the presence of non-integer data. Values are rounded to the nearest integer, and the categories are renamed to make the informational content immediately understandable.
Code
dataset$vegetable_food <-round(dataset$vegetable_food)ggplot(dataset, aes(x =factor(vegetable_food, levels =c(1, 2, 3), labels =c("Rarely", "Sometimes", "Often")))) +geom_bar(fill ="skyblue", color ="black") +labs(title ="Vegetable Consumption Frequency", x ="Frequency of Vegetable Consumption", y ="Count") +theme_minimal() +theme(axis.text.x =element_text(angle =0, vjust =0.5, hjust=0.5))
Vegetable intake is generally moderate, with a mean score of 2.4 on a 1-to-3 scale (1 = “Rarely”, 3 = “Often”).
However, a detailed distribution of consumption and an analysis of associations with other risk factors are necessary to make informed statements about the quality of dietary habits and potential health risks.
1.0.2.1.9 Number of meals per day
The same issues as with the previous variable were initially encountered in the graphical representation. Similarly, the values were rounded and the labels were renamed.
Code
ggplot(dataset, aes(x =factor(round(nb_meal_day), levels =c(1, 2, 3, 4, 5), labels =c("1 Meal", "2 Meals", "3 Meals", "4 Meals", "5+ Meals")))) +geom_bar(fill ="orange", color ="black") +labs(title ="Number of Meals per Day", x ="Meals per Day", y ="Count") +theme_minimal() +theme(axis.text.x =element_text(angle =0, vjust =0.5, hjust =0.5))
Most participants consume three meals daily, with a mean of 2.7 and a standard deviation of 0.5.
Calculate the correlation between the number of meals per day and weight. Select the variables for correlation: number of meals per day and weight.
Code
correlation_meals_weight <-cor(dataset$nb_meal_day, dataset$weight, use ="complete.obs")correlation_meals_weight
[1] 0.09214947
The correlation between the number of meals per day and weight is relatively weak; this suggests that meal frequency alone may not have a direct impact on weight; instead, other factors like meal quality and portion sizes could play a more significant role.
1.0.2.1.10 Physical activity
Plot histogram and density.
Code
ggplot(dataset, aes(x = physical_act)) +geom_histogram(aes(y = ..density..), bins =30, fill ="skyblue", color ="black", alpha =0.6) +geom_density(color ="darkblue", size =1) +ggtitle("Histogram and Density of Physical Activity") +theme_minimal() +labs(x ="Physical Activity", y ="Density")
Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
ℹ Please use `after_stat(density)` instead.
The histogram and density plot reveal that physical activity levels have distinct peaks at 0, 1, 2, and 3, suggesting that these values are common reported levels. Intermediate values, likely due to synthetic data or SMOTE, are also present but less frequent.
Violin plot by category.
Code
ggplot(dataset, aes(x = obesity_lev, y = physical_act, fill = obesity_lev)) +# Replace 'obesity_lev' with any category variablegeom_violin(trim =FALSE) +geom_boxplot(width =0.1, color ="black", fill ="white") +ggtitle("Violin Plot of Physical Activity by Obesity Level") +theme_minimal() +labs(x ="Obesity Level", y ="Physical Activity") +theme(legend.position ="none")
Individuals with lower obesity levels (e.g., Insufficient and Normal Weight) have a wider spread of physical activity levels, often skewed toward higher activity #Higher obesity levels show a tendency toward lower physical activity.
Scatter Plot with Age.
Code
ggplot(dataset, aes(x = age, y = physical_act)) +# Replace 'age' with another continuous variablegeom_point(alpha =0.6) +geom_smooth(method ="lm", color ="blue", se =FALSE) +ggtitle("Scatter Plot of Physical Activity vs Age") +theme_minimal() +labs(x ="Age", y ="Physical Activity")
`geom_smooth()` using formula = 'y ~ x'
There appears to be a negative trend between age and physical activity levels, with younger individuals tending to report higher physical activity levels.
1.0.2.1.11 Water consumption
Plot histogram and density for water consumption.
Code
ggplot(dataset, aes(x = ch2o)) +geom_histogram(aes(y = ..density..), bins =30, fill ="skyblue", color ="black", alpha =0.6) +geom_density(color ="darkblue", size =1) +ggtitle("Histogram and Density of Comsumption of Water") +theme_minimal() +labs(x ="CH2O", y ="Density")
This histogram and density plot of daily water consumption (CH2O) shows a clear peak at 2 liters, indicating that most individuals consume around this amount. This aligns with scientific literature, which generally recommends an average daily water intake of about 2 liters for optimal health.
Facet Plot (CH2O by Obesity Level and Gender).
Code
ggplot(dataset, aes(x = ch2o, fill = gender)) +geom_histogram(bins =20, alpha =0.6, position ="dodge") +facet_wrap(~ obesity_lev) +ggtitle("Histogram of CH2O by Obesity Level and Gender") +theme_minimal() +labs(x ="CH2O", y ="Count", fill ="Gender")
An uncommon trend is shown in Obesity Type II predominantly includes males, while Obesity Type III mostly includes females, highlighting a gender disparity at higher obesity levels in water consumption patterns.
Scatter Plot (ch2o vs Age).
Code
ggplot(dataset, aes(x = age, y = ch2o)) +geom_point(alpha =0.5, color ="darkblue") +geom_smooth(method ="lm", color ="red", linetype ="dashed") +ggtitle("Scatter Plot of CH2O vs Age with Trendline") +theme_minimal() +labs(x ="Age", y ="CH2O")
`geom_smooth()` using formula = 'y ~ x'
Slight downward trend suggests a minor decrease in water consumption with age, though variability remains high across all ages.
Violin Plot by Gender.
Code
ggplot(dataset, aes(x = gender, y = ch2o, fill = gender)) +geom_violin(trim =FALSE, alpha =0.7) +ggtitle("Violin Plot of CH2O by Gender") +theme_minimal() +labs(x ="Gender", y ="CH2O") +theme(legend.position ="none")
Males show more variability in water consumption compared to females, with similar median values around 2. The water consumption variable (CH2O) shows a predominant consumption level around 2 across different obesity levels, with minor variations by gender and obesity category. Originally, CH2O values were discrete (1.0, 2.0, 3.0), suggesting categorical consumption levels. However, after applying the SMOTE algorithm to address class imbalance, interpolated values emerged, resulting in a more continuous distribution. This variable’s distribution across obesity levels, along with gender differences at extreme obesity levels (e.g., more males in Obesity Type II and females in Obesity Type III), indicates that CH2O could be a valuable predictor for obesity, capturing both consumption habits and subtle demographic patterns.
1.0.2.1.12 Technology utilization
Histogram with Density.
Code
ggplot(dataset, aes(x = use_tech)) +geom_histogram(aes(y = ..density..), bins =30, fill ="lightblue", color ="black", alpha =0.6) +geom_density(color ="blue", size =1) +labs(title ="Histogram and Density of Use of Technology", x ="Use of Technology", y ="Density") +theme_minimal()
The histogram shows a strong concentration at discrete values (0, 1, and 2), likely reflecting the original categorical nature of the data before SMOTE. This also aligns with observed density peaks.
Density of Use of Technology by Obesity Level.
Code
ggplot(dataset, aes(x = use_tech, fill = obesity_lev)) +geom_density(alpha =0.5) +labs(title ="Density of Use of Technology by Obesity Level", x ="Use of Technology", y ="Density") +theme_minimal()
The density plot reveals distinct peaks in technology usage across obesity levels, with some levels like Obesity_Type_III having higher peaks, suggesting varied levels of technology use in these categories.
Boxplot by Obesity Level.
Code
ggplot(dataset, aes(x = obesity_lev, y = use_tech, fill = obesity_lev)) +geom_boxplot() +labs(title ="Boxplot of Use of Technology by Obesity Level", x ="Obesity Level", y ="Use of Technology") +theme_minimal() +theme(legend.position ="none")
Technology use shows some variation across obesity levels, with certain categories (like Obesity_Type_I and Obesity_Type_III) showing higher median usage compared to others.
Scatter Plot with Age.
Code
ggplot(dataset, aes(x = age, y = use_tech)) +geom_point(alpha =0.4, color ="blue") +geom_smooth(method ="lm", color ="red", linetype ="dashed") +labs(title ="Scatter Plot of Use of Technology vs Age", x ="Age", y ="Use of Technology") +theme_minimal()
`geom_smooth()` using formula = 'y ~ x'
There is a noticeable negative correlation between age and technology use, indicating younger individuals tend to use technology more than older ones.
The ‘Use of Technology’ variable shows a clear trend where younger individuals use technology more frequently, as seen in the negative correlation with age. Differences in technology usage across obesity levels suggest it might have predictive value in distinguishing between levels, although the SMOTE algorithm has introduced interpolated values that blur strict categories. This variable could therefore help predict obesity levels, especially if technology usage patterns are indicative of lifestyle factors associated with obesity.
1.0.3 4. Analysis overview of statistical methods and model selection
In the present analytical endeavor, we plan to employe a regression model approach to elucidate the intricate dynamics between a set of independent variables, which serve as the predictors, and Obesity Level as a singular dependent variable which is the outcome.
The rationale behind the selection of regression modeling stems from its established robustness as a statistical methodology, particularly adept at unraveling and quantifying the interrelations among variables. This is paramount, considering our overarching objective to forecast outcomes and to meticulously evaluate the repercussions that alterations in the predictor variables may have on the target variable.
Based on our exploratory data analysis, indications of potential outliers emerged within our dataset. However, upon closer examination, these values represent extreme data points that remain plausible given the context of our study. Consequently, our approach involves building two regression models: one that includes these extreme values and one that excludes them. The objective is to examine the impact of these extreme data points on the predictive performance of the model, analyzing how their presence or absence influences the resulting predictions and model behavior.
The idea behind the adoption of regression analysis is twofold. Firstly, it affords a nuanced understanding of the extent to which each predictor influences the outcome. Secondly, it provides a suite of statistical metrics that facilitate the evaluation of the model’s capacity to elucidate the variance in the data. Through regression analysis, we can ascertain the presence of statistically significant linkages between the variables under scrutiny and quantify the magnitude and trajectory of these associations. This method endows us with coefficients that reflect the anticipated alteration in the dependent variable corresponding to a unit change in the predictors, whilst controlling for the constancy of other variables. That answers the first part of our research question. In top of that the regression model will permit us to ascertain the extent to which our independent variables account for the variability observed in the dependent variable (Assess Predictive Power), but we also will be able to delineate the individual impact magnitudes exerted by each predictor variable and to validate the statistical significance of these effects(Quantify Effects:).
At last, by integrating pertinent covariates and control variables into the model, we aim to attenuate biases and segregate the influence of the primary predictors on the outcome, thereby enhancing the accuracy of our findings.By looking at R², the P-values and the standardized coefficients we should be able to understand what are the key factors that can influence the weight condition of a person( obesity level).
To ensure the performance of the model we will need to check the linearity between the values, the normality of residuals and the homogeneity of Variance. And lastly we will check which non significant variable Variance Inflation Factor (VIF) is a statistical measure used to detect multicollinearity in a regression model. Multicollinearity occurs when two or more independent variables in a model are highly correlated, meaning they contain redundant information. High multicollinearity can distort the estimates of coefficients, making it difficult to interpret the individual effect of each predictor.
We also want to build a predictive model. The EDA and the regression model will likely show that some of the key factors of our dataset are useful to make prediction about the type of weight someone will have. Once we identified relationships within our data, we aim to make reliable predictions about future outcomes. the regression will also help us understand which variables have the most significant impact on obesity level. And by using other statistical metrics Mean Absolute Error (MAE), Root Mean Square Error (RMSE), and R², we will assess how well the model performs and refine it as needed to improve accuracy.
1.0.4 5. Conclusion
So far, we have conducted a comprehensive exploration and preparation of our dataset, focusing on understanding the influence of lifestyle factors on obesity within a sample from Mexico, Peru, and Colombia. The dataset, which was pre-processed with SMOTE to address class imbalance, has provided us with balanced obesity categories, facilitating an in-depth analysis of key variables such as eating habits, physical activity, and alcohol consumption. Through correlation analysis, we identified the variables with the strongest associations to obesity levels, helping to guide our selection of factors for inclusion in the next modeling phase. Additionally, we have thoroughly cleaned and structured the data, renaming variables for clarity, formatting categorical variables, and removing duplicates to ensure a solid foundation for robust modeling.
The next steps involve constructing regression models to analyze the relationships and predictive power of these selected factors on obesity levels. Specifically, we will develop two versions of the model—one that includes extreme values and one that excludes them—to evaluate the impact of outliers on model accuracy and stability. Key metrics such as R², P-values, and VIF will be used to confirm the reliability of the model and address potential multicollinearity issues. Following this, we will build and fine-tune a predictive model using metrics like Mean Absolute Error (MAE), Root Mean Square Error (RMSE), and R² to validate and enhance performance.
These efforts will culminate in a final report that, while primarily an exercise and not applicable in real-world contexts, highlights our findings and offers insights into the most influential lifestyle factors affecting obesity. This analysis aims to provide actionable recommendations within a simulated scenario, illustrating how data-driven insights could support public health strategies focused on obesity reduction.
1.1 Next Steps
Outline the next steps planned for completing the project, such as refining analyses, adding new methods, or addressing outstanding data issues.
1.2 Final Thoughts
Briefly reflect on any challenges or limitations encountered so far and how these might be addressed in the final report.
Source Code
---title:Project Update Report (Group G): Code and Structureauthor: - Dorofieiev,Illia - Pizzi, Alessandro - Lovato, Andrea - El Abed, Aymaninstitute: University of Lausannedate: todaytitle-block-banner: "#0095C8" # chosen for the university of lausannetoc: truetoc-location: rightformat: html: number-sections: true html-math-method: katex self-contained: true code-overflow: wrap code-fold: true code-tools: true include-in-header: # add custom css to make the text in the `</> Code` dropdown black text: | <style type="text/css"> .quarto-title-banner a { color: #000000; } </style> pdf: # use this if you want to render pdfs instead include-in-header: # wrapping the code also in the pdf (otherwise, it overflows) text: | \usepackage{fvextra} \DefineVerbatimEnvironment{Highlighting}{Verbatim}{ commandchars=\\\{\}, breaklines, breaknonspaceingroup, breakanywhere }abstract: | This is the abstract of the report. It should be a short summary of the project, the data, the analysis and the results. It should be concise and to the point. It should not be longer than 250 words.---```{r}#| label: setup#| echo: false#| message: false# loading all the necessary packagessource(here::here("src", "setup.R"))```::: {.callout-tip}### How to include sections separately- You can use `{include X}` to include different sections of your report as separate `.qmd` files. This is also well documented in the Quarto documentation: <https://quarto.org/docs/authoring/includes>- As mentioned in the documentation, we have used (_) prefix for the included files (e.g., `_introduction.qmd` and `_data.qmd`). You should always use an underscore prefix with included files so that they are automatically ignored (i.e. not treated as standalone files) by a quarto render of a project (not absolutely necessary in your case, but highly recommended).- Rendering only `report.qmd` will render also all the other files.:::{{< include sections/_introduction.qmd >}}{{< include sections/_data.qmd >}}{{< include sections/_eda.qmd >}}{{< include sections/_analysis.qmd >}}{{< include sections/_conclusion.qmd >}}